AITopics | data preparation

Collaborating Authors

data preparation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EPIC-KITCHENSVISORBenchmark VIdeoSegmentationsandObjectRelations-Appendix

Neural Information Processing SystemsFeb-9-2026, 04:29:17 GMT

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly(i.e.,incombinationwithotherdata)fromthedataset?

annotation, artificial intelligence, ifso, (18 more...)

Neural Information Processing Systems

Country: North America > United States > Michigan (0.04)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?

Spreafico, Matteo, Tassini, Ludovica, Sancricca, Camilla, Cappiello, Cinzia

arXiv.org Artificial IntelligenceDec-1-2025

Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a critical yet often labor-intensive step in data-driven processes. This paper investigates whether large language models can effectively support users in selecting and automating data preparation tasks. To this aim, we considered both general-purpose and fine-tuned tabular large language models. We prompted these models with poor-quality datasets and measured their ability to perform tasks such as data profiling and cleaning. We also compare the support provided by large language models with that offered by traditional data preparation tools. To evaluate the capabilities of large language models, we developed a custom-designed quality model that has been validated through a user study to gain insights into practitioners' expectations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.21708

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment

Yusuf, Ibrahim Salihu, Houndayi, Iffanice, Oualha, Rym, Cherif, Mohamed Aziz, Panford-Quainoo, Kobby, Pretorius, Arnu

arXiv.org Artificial IntelligenceOct-8-2025

Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git

data quality, instageo, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2510.05617

Genre: Research Report (0.52)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.34)

Technology:

Information Technology > Data Science > Data Quality (0.54)
Information Technology > Artificial Intelligence > Machine Learning (0.51)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.34)

Add feedback

Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

Zhang, Heng, Zhang, Chengzhi

arXiv.org Artificial IntelligenceSep-24-2025

The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.12955

Country: Asia > China (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.46)
Information Technology > Software (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Empowering Tabular Data Preparation with Language Models: Why and How?

Chen, Mengshi, Sun, Yuxiang, Li, Tengchao, Wang, Jianwei, Wang, Kai, Lin, Xuemin, Zhang, Ying, Zhang, Wenjie

arXiv.org Artificial IntelligenceAug-5-2025

Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other components for different preparation tasks, highlight key advancements, and outline prospective pipelines.

data mining, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.01556

Country:

North America > United States (0.93)
Europe (0.93)
Asia > Middle East > UAE (0.28)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LLaPipe: LLM-Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction

Chang, Jing, Liu, Chang, Huang, Jinbin, Mao, Rui, Qin, Jianbin

arXiv.org Artificial IntelligenceJul-21-2025

Automated data preparation is crucial for democratizing machine learning, yet existing reinforcement learning (RL) based approaches suffer from inefficient exploration in the vast space of possible preprocessing pipelines. We present LLaPipe, a novel framework that addresses this exploration bottleneck by integrating Large Language Models (LLMs) as intelligent policy advisors. Unlike traditional methods that rely solely on statistical features and blind trial-and-error, LLaPipe leverages the semantic understanding capabilities of LLMs to provide contextually relevant exploration guidance. Our framework introduces three key innovations: (1) an LLM Policy Advisor that analyzes dataset semantics and pipeline history to suggest promising preprocessing operations, (2) an Experience Distillation mechanism that mines successful patterns from past pipelines and transfers this knowledge to guide future exploration, and (3) an Adaptive Advisor Triggering strategy (Advisor\textsuperscript{+}) that dynamically determines when LLM intervention is most beneficial, balancing exploration effectiveness with computational cost. Through extensive experiments on 18 diverse datasets spanning multiple domains, we demonstrate that LLaPipe achieves up to 22.4\% improvement in pipeline quality and 2.3$\times$ faster convergence compared to state-of-the-art RL-based methods, while maintaining computational efficiency through selective LLM usage (averaging only 19.0\% of total exploration steps).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.13712

Country: Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Preference-based learning for news headline recommendation

Bouras, Alexandre, Durand, Audrey, Khoury, Richard

arXiv.org Artificial IntelligenceJun-10-2025

This study explores strategies for optimizing news headline recommendations through preference-based learning. Using real-world data of user interactions with French-language online news posts, we learn a headline recommender agent under a contextual bandit setting. This allows us to explore the impact of translation on engagement predictions, as well as the benefits of different interactive strategies on user engagement during data collection. Our results show that explicit exploration may not be required in the presence of noisy contexts, opening the door to simpler but efficient strategies in practice.

data mining, machine learning, user engagement, (17 more...)

arXiv.org Artificial Intelligence

2506.06334

Genre: Research Report > New Finding (0.55)

Industry: Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.67)

Add feedback

FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning

Zeng, Zhihao, Fang, Ziquan, Shao, Wei, Chen, Lu, Gao, Yunjun

arXiv.org Artificial IntelligenceMay-9-2025

Trajectory data, which capture the movement patterns of people and vehicles over time and space, are crucial for applications like traffic optimization and urban planning. However, issues such as noise and incompleteness often compromise data quality, leading to inaccurate trajectory analyses and limiting the potential of these applications. While Trajectory Data Preparation (TDP) can enhance data quality, existing methods suffer from two key limitations: (i) they do not address data privacy concerns, particularly in federated settings where trajectory data sharing is prohibited, and (ii) they typically design task-specific models that lack generalizability across diverse TDP scenarios. To overcome these challenges, we propose FedTDP, a privacy-preserving and unified framework that leverages the capabilities of Large Language Models (LLMs) for TDP in federated environments. Specifically, we: (i) design a trajectory privacy autoencoder to secure data transmission and protect privacy, (ii) introduce a trajectory knowledge enhancer to improve model learning of TDP-related knowledge, enabling the development of TDP-oriented LLMs, and (iii) propose federated parallel optimization to enhance training efficiency by reducing data transmission and enabling parallel model training. Experiments on 6 real datasets and 10 mainstream TDP tasks demonstrate that FedTDP consistently outperforms 13 state-of-the-art baselines.

data mining, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2505.05155

Country:

North America > United States (0.14)
Asia > China > Beijing > Beijing (0.05)
Asia > China > Zhejiang Province (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (1.00)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Comparing Methods for Bias Mitigation in Graph Neural Networks

Hoffmann, Barbara, Mayer, Ruben

arXiv.org Artificial IntelligenceMar-28-2025

This paper examines the critical role of Graph Neural Networks (GNNs) in data preparation for generative artificial intelligence (GenAI) systems, with a particular focus on addressing and mitigating biases. We present a comparative analysis of three distinct methods for bias mitigation: data sparsification, feature modification, and synthetic data augmentation. Through experimental analysis using the german credit dataset, we evaluate these approaches using multiple fairness metrics, including statistical parity, equality of opportunity, and false positive rates. Our research demonstrates that while all methods improve fairness metrics compared to the original dataset, stratified sampling and synthetic data augmentation using GraphSAGE prove particularly effective in balancing demographic representation while maintaining model performance. The results provide practical insights for developing more equitable AI systems while maintaining model performance.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2503.22569

Country: Europe > Germany > Bavaria > Upper Franconia > Bayreuth (0.05)

Genre: Research Report (0.84)

Industry: Health & Medicine (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

Reinforced Large Language Model is a formal theorem prover

Luo, Zhiling

arXiv.org Artificial IntelligenceFeb-12-2025

Theorem formalization and proof are fundamental processes in mathematics and computer science that involve expressing mathematical statements in a precise, unambiguous language and rigorously demonstrating their truth or validity. Formalization typically begins by translating an informal mathematical theorem into a formal logical system, where every term, assumption, and conclusion is explicitly defined using axioms, rules of inference, and symbolic notation. This ensures that the statement is free from ambiguity and can be manipulated systematically. Once formalized, the proof process involves constructing a sequence of logically valid steps, starting from agreed-upon axioms or previously proven theorems, to establish the desired result. Proofs can range from simple direct arguments to complex constructions involving advanced techniques like induction, contradiction, or model theory.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.08908

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback